In [1]:
# dimension reduction - all together
###
###
###
In [2]:
# based on the example by Prof. Katarzyna Kopczewska
In [3]:
CIA<-read.csv("#USL 09 CIAdata.csv", header=TRUE, sep=";", dec=",")
dim(CIA)
summary(CIA)
  1. 222
  2. 19
           COUNTRY                  CONTINENT   AREA..SQ.KM.     
 AFGHANISTAN   :  1   AFRICA             :52   Min.   :       2  
 ALBANIA       :  1   EUROPE             :49   1st Qu.:    4407  
 ALGERIA       :  1   CENTRAL AMERICA    :31   Median :   81234  
 AMERICAN SAMOA:  1   AUSTRALIA & OCEANIA:20   Mean   :  600884  
 ANDORRA       :  1   MIDDLE EAST        :19   3rd Qu.:  447188  
 ANGOLA        :  1   EAST ASIA          :18   Max.   :17098242  
 (Other)       :216   (Other)            :33                     
   POPULATION        X.BIRTHS.1000.POPULATION. LIFE.EXPECTANCY..YEARS.
 Min.   :2.000e+00   Min.   : 6.40             Min.   :52.80          
 1st Qu.:6.096e+05   1st Qu.:11.40             1st Qu.:69.28          
 Median :5.917e+06   Median :15.40             Median :75.60          
 Mean   :3.377e+07   Mean   :18.31             Mean   :73.94          
 3rd Qu.:2.128e+07   3rd Qu.:23.00             3rd Qu.:79.20          
 Max.   :1.394e+09   Max.   :47.50             Max.   :89.30          
                     NA's   :1                 NA's   :2              
 ADULT.OBESITY.... GDP...PER.CAPITA..PPP. UNEMPLOYMENT.RATE....
 Min.   : 2.10     Min.   :   700         Min.   : 0.30        
 1st Qu.: 9.60     1st Qu.:  5750         1st Qu.: 3.77        
 Median :20.80     Median : 15000         Median : 6.90        
 Mean   :20.01     Mean   : 24045         Mean   :10.20        
 3rd Qu.:25.60     3rd Qu.: 35550         3rd Qu.:11.85        
 Max.   :61.00     Max.   :139100         Max.   :77.00        
 NA's   :37        NA's   :3              NA's   :11           
 ELECTRICITY.CONSUMPTION..KWH. MOBILE.PHONES       INTERNET.USERS     
 Min.   :0.000e+00             Min.   :1.810e+02   Min.   :        3  
 1st Qu.:6.300e+08             1st Qu.:7.927e+05   1st Qu.:   248004  
 Median :6.798e+09             Median :6.551e+06   Median :  2395886  
 Mean   :1.007e+11             Mean   :3.781e+07   Mean   : 17288473  
 3rd Qu.:4.067e+10             3rd Qu.:2.177e+07   3rd Qu.:  9251773  
 Max.   :5.564e+12             Max.   :1.673e+09   Max.   :751886119  
 NA's   :11                    NA's   :11          NA's   :5          
 RAILWAYS..KM.      MILITARY.EXPENDITURES....OF.GDP.
 Min.   :     9.0   Min.   :0.200                   
 1st Qu.:   568.5   1st Qu.:1.000                   
 Median :  1673.0   Median :1.500                   
 Mean   :  7990.1   Mean   :1.887                   
 3rd Qu.:  4604.5   3rd Qu.:2.280                   
 Max.   :293564.0   Max.   :8.800                   
 NA's   :91         NA's   :69                      
 POPULATION.DENSITY..PERS.SQ.KM. ELECTRICITY.CONSUMPTION.PER.CAPITA
 Min.   :    0.01                Min.   :      0.0                 
 1st Qu.:   29.82                1st Qu.:    521.4                 
 Median :   84.43                Median :   2134.4                 
 Mean   :  370.28                Mean   :  33980.6                 
 3rd Qu.:  209.98                3rd Qu.:   5577.9                 
 Max.   :21789.29                Max.   :2877922.1                 
                                 NA's   :11                        
 MOBILE.PHONES....  INTERNET.USERS.... RAILWAY.DENSITY  
 Min.   :   0.000   Min.   :   0.00    Min.   :0.00000  
 1st Qu.:   0.845   1st Qu.:   0.25    1st Qu.:0.00305  
 Median :   1.100   Median :   0.59    Median :0.00730  
 Mean   :  14.448   Mean   :  11.42    Mean   :0.02133  
 3rd Qu.:   1.330   3rd Qu.:   0.80    3rd Qu.:0.03010  
 Max.   :1268.120   Max.   :1019.26    Max.   :0.19160  
 NA's   :11         NA's   :5          NA's   :91       
In [4]:
# eliminating columns with names
rownames(CIA)<-CIA[,1]
countries<-CIA[,1]
continents<-CIA[,2]
CIA<-CIA[,c(-1,-2)]
In [5]:
# new re-scaled variables
CIA$AREA.100<-CIA$AREA..SQ.KM./mean(CIA$AREA..SQ.KM., na.rm=TRUE)
CIA$POPUL.100<-CIA$POPULATION/mean(CIA$POPULATION, na.rm=TRUE)
CIA$GDP.PC.PPP.100<-CIA$GDP...PER.CAPITA..PPP./mean(CIA$GDP...PER.CAPITA..PPP., na.rm=TRUE)
In [6]:
names(CIA)
  1. 'AREA..SQ.KM.'
  2. 'POPULATION'
  3. 'X.BIRTHS.1000.POPULATION.'
  4. 'LIFE.EXPECTANCY..YEARS.'
  5. 'ADULT.OBESITY....'
  6. 'GDP...PER.CAPITA..PPP.'
  7. 'UNEMPLOYMENT.RATE....'
  8. 'ELECTRICITY.CONSUMPTION..KWH.'
  9. 'MOBILE.PHONES'
  10. 'INTERNET.USERS'
  11. 'RAILWAYS..KM.'
  12. 'MILITARY.EXPENDITURES....OF.GDP.'
  13. 'POPULATION.DENSITY..PERS.SQ.KM.'
  14. 'ELECTRICITY.CONSUMPTION.PER.CAPITA'
  15. 'MOBILE.PHONES....'
  16. 'INTERNET.USERS....'
  17. 'RAILWAY.DENSITY'
  18. 'AREA.100'
  19. 'POPUL.100'
  20. 'GDP.PC.PPP.100'
In [7]:
# new names (to eliminate dots)
colnames(CIA)<-c("AREA.sqkm", "POPULATION", "BIRTHS.1000.POPUL", "LIFE.EXP.YEARS",  "ADULT.OBESITY", "GDP.PC.PPP", "UNEMPL.RATE", "ELECTRICITY.CONSUMPTION.KWH", "MOBILE.PHONES", "INTERNET.USERS", "RAILWAYS.KM", "MILITARY.EXP.TO.GDP", "POPUL.DENSITY", "ELECTR.CONS.PC", "MOBILE.PH.PC", "INTERNET.USERS.PC", "RAILWAY.DENSITY", "AREA.100", "POPUL.100", "GDP.PC.PPP.100")
In [8]:
# keeping only scaled variables
CIA<-CIA[,c(-1, -2, -6, -8, -9, -10, -11)]
In [9]:
# imputation to eliminate many missing values
install.packages("mice")
library(mice)
CIA.imput<-mice(CIA)
#CIA.new<-mice(CIA, m=5, maxit=50, method="pmm")
CIA.new<-complete(CIA.imput)
summary(CIA.new)
x<-which(is.na(CIA.new$MOBILE.PH.PC==TRUE))
CIA.new$MOBILE.PH.PC[x]<-0
x<-which(is.na(CIA.new$GDP.PC.PPP.100==TRUE))
CIA.new$GDP.PC.PPP.100[x]<-0
summary(CIA.new)
Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)

Loading required package: lattice


Attaching package: ‘mice’


The following objects are masked from ‘package:base’:

    cbind, rbind


 iter imp variable
  1   1  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  1   2  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  1   3  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  1   4  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  1   5  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  2   1  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  2   2  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  2   3  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  2   4  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  2   5  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  3   1  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  3   2  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  3   3  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  3   4  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  3   5  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  4   1  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  4   2  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  4   3  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  4   4  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  4   5  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  5   1  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  5   2  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  5   3  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  5   4  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
  5   5  BIRTHS.1000.POPUL  LIFE.EXP.YEARS  ADULT.OBESITY  UNEMPL.RATE  MILITARY.EXP.TO.GDP  ELECTR.CONS.PC  INTERNET.USERS.PC  RAILWAY.DENSITY  GDP.PC.PPP.100
Warning message:
“Number of logged events: 1”
 BIRTHS.1000.POPUL LIFE.EXP.YEARS  ADULT.OBESITY     UNEMPL.RATE   
 Min.   : 6.40     Min.   :52.80   Min.   : 2.100   Min.   : 0.30  
 1st Qu.:11.43     1st Qu.:69.22   1st Qu.: 9.925   1st Qu.: 3.83  
 Median :15.40     Median :75.60   Median :21.250   Median : 6.90  
 Mean   :18.33     Mean   :73.95   Mean   :20.475   Mean   :10.31  
 3rd Qu.:23.00     3rd Qu.:79.20   3rd Qu.:25.775   3rd Qu.:11.98  
 Max.   :47.50     Max.   :89.30   Max.   :61.000   Max.   :77.00  
                                                                   
 MILITARY.EXP.TO.GDP POPUL.DENSITY      ELECTR.CONS.PC       MOBILE.PH.PC     
 Min.   :0.200       Min.   :    0.01   Min.   :      0.0   Min.   :   0.000  
 1st Qu.:1.000       1st Qu.:   29.82   1st Qu.:    554.2   1st Qu.:   0.845  
 Median :1.500       Median :   84.43   Median :   2283.9   Median :   1.100  
 Mean   :2.024       Mean   :  370.28   Mean   :  45014.4   Mean   :  14.448  
 3rd Qu.:2.375       3rd Qu.:  209.98   3rd Qu.:   5811.4   3rd Qu.:   1.330  
 Max.   :8.800       Max.   :21789.29   Max.   :2877922.1   Max.   :1268.120  
                                                            NA's   :11        
 INTERNET.USERS.PC   RAILWAY.DENSITY       AREA.100           POPUL.100       
 Min.   :   0.0000   Min.   :0.000000   Min.   : 0.000003   Min.   : 0.00000  
 1st Qu.:   0.2425   1st Qu.:0.002975   1st Qu.: 0.007335   1st Qu.: 0.01805  
 Median :   0.5900   Median :0.008800   Median : 0.135190   Median : 0.17523  
 Mean   :  11.1705   Mean   :0.024479   Mean   : 1.000000   Mean   : 1.00000  
 3rd Qu.:   0.8000   3rd Qu.:0.034000   3rd Qu.: 0.744216   3rd Qu.: 0.63006  
 Max.   :1019.2600   Max.   :0.191600   Max.   :28.455142   Max.   :41.28128  
                                                                              
 GDP.PC.PPP.100   
 Min.   :0.02911  
 1st Qu.:0.23810  
 Median :0.62592  
 Mean   :1.01253  
 3rd Qu.:1.50553  
 Max.   :5.78505  
                  
 BIRTHS.1000.POPUL LIFE.EXP.YEARS  ADULT.OBESITY     UNEMPL.RATE   
 Min.   : 6.40     Min.   :52.80   Min.   : 2.100   Min.   : 0.30  
 1st Qu.:11.43     1st Qu.:69.22   1st Qu.: 9.925   1st Qu.: 3.83  
 Median :15.40     Median :75.60   Median :21.250   Median : 6.90  
 Mean   :18.33     Mean   :73.95   Mean   :20.475   Mean   :10.31  
 3rd Qu.:23.00     3rd Qu.:79.20   3rd Qu.:25.775   3rd Qu.:11.98  
 Max.   :47.50     Max.   :89.30   Max.   :61.000   Max.   :77.00  
 MILITARY.EXP.TO.GDP POPUL.DENSITY      ELECTR.CONS.PC       MOBILE.PH.PC     
 Min.   :0.200       Min.   :    0.01   Min.   :      0.0   Min.   :   0.000  
 1st Qu.:1.000       1st Qu.:   29.82   1st Qu.:    554.2   1st Qu.:   0.760  
 Median :1.500       Median :   84.43   Median :   2283.9   Median :   1.075  
 Mean   :2.024       Mean   :  370.28   Mean   :  45014.4   Mean   :  13.732  
 3rd Qu.:2.375       3rd Qu.:  209.98   3rd Qu.:   5811.4   3rd Qu.:   1.310  
 Max.   :8.800       Max.   :21789.29   Max.   :2877922.1   Max.   :1268.120  
 INTERNET.USERS.PC   RAILWAY.DENSITY       AREA.100           POPUL.100       
 Min.   :   0.0000   Min.   :0.000000   Min.   : 0.000003   Min.   : 0.00000  
 1st Qu.:   0.2425   1st Qu.:0.002975   1st Qu.: 0.007335   1st Qu.: 0.01805  
 Median :   0.5900   Median :0.008800   Median : 0.135190   Median : 0.17523  
 Mean   :  11.1705   Mean   :0.024479   Mean   : 1.000000   Mean   : 1.00000  
 3rd Qu.:   0.8000   3rd Qu.:0.034000   3rd Qu.: 0.744216   3rd Qu.: 0.63006  
 Max.   :1019.2600   Max.   :0.191600   Max.   :28.455142   Max.   :41.28128  
 GDP.PC.PPP.100   
 Min.   :0.02911  
 1st Qu.:0.23810  
 Median :0.62592  
 Mean   :1.01253  
 3rd Qu.:1.50553  
 Max.   :5.78505  
In [10]:
# k-means
install.packages("factoextra")
library(factoextra)

# for features
#km.f<-eclust(t(CIA), "kmeans", hc_metric="euclidean",k=5) # does not work due to NAs
km.f<-eclust(t(CIA.new), "kmeans", hc_metric="euclidean",k=5)

fviz_cluster(km.f, main="kmeans / Euclidean", ylim=c(-15, 5), xlim=c(-15, 55), labelsize=10, repel=TRUE)
fviz_silhouette(km.f)
Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)

Loading required package: ggplot2

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

  cluster size ave.sil.width
1       1    2          0.71
2       2    1          0.00
3       3    8          0.80
4       4    1          0.00
5       5    1          0.00
In [11]:
# for countries
km.c<-eclust(CIA.new, "kmeans", hc_metric="euclidean",k=8)
fviz_cluster(km.c, main="kmeans / Euclidean")
fviz_silhouette(km.c)
  cluster size ave.sil.width
1       1    8          0.56
2       2    1          0.00
3       3    2          0.27
4       4   40          0.35
5       5  111          0.69
6       6   54          0.58
7       7    1          0.00
8       8    5          0.59
In [12]:
# we select medoids – most middle representants of world’regions
vec<-unique(continents)
meds<-matrix(0, ncol=1, nrow=10)
for(i in 1:10){
vec[i]
x<-which(continents==vec[i])
sub<-CIA.new[x,]
ppam<-eclust(sub, "pam", k=1) # factoextra::
meds[i,1]<-rownames(ppam$medoids)}
meds
rownames(meds)<-vec
conti.rep.PAM<-CIA.new[meds,]
rownames(conti.rep.PAM)<-vec
A matrix: 10 × 1 of type chr
187
180
47
217
8
27
202
210
215
203
In [13]:
# we use k-means – artificial parameters of centroid observation of world’ regions
vec<-unique(continents)
meds<-matrix(0, ncol=13, nrow=10)
for(i in 1:10){
vec[i]
x<-which(continents==vec[i])
sub<-CIA.new[x,]
kkm<-eclust(sub, "kmeans", k=1) # factoextra::
meds[i,]<-kkm$centers}
meds
rownames(meds)<-vec
conti.rep.km<-meds
A matrix: 10 × 13 of type dbl
20.6250070.46250 5.73750 7.0912502.037500 467.36250 774.6350 1.0250000 0.31750000.0255250001.06832826.712128760.3254349
10.2591879.7489822.37551 6.9865311.720204 271.50122 93441.213540.916938837.14693880.0448918370.20390800.367905221.7967350
30.7211564.7288512.0615415.3973081.605192 99.84519 585.0156 0.8394231 0.26442310.0082750000.88376480.700600840.2611317
17.9950075.1100038.6350012.6590002.736000 113.15200244300.396042.472000028.15800000.0155050000.67393240.049932120.6192620
14.3903277.0258123.0161310.2967741.405806 238.30161 4652.1774 2.0793548 1.03967740.0323161290.03982390.085463900.9307914
15.5076975.6846223.52308 7.4600001.541538 21.47692 2349.7169 1.1307692 0.55692310.0133538462.27159120.969293380.7985111
17.8789576.1105327.1473715.1731584.634737 529.33053 5545.1295 1.2410526 0.76894740.0259473680.56691690.557784631.4643749
12.0333379.4166729.45000 6.3083332.173333 43.01667 21328.2183 0.8783333 2.14500000.0204666676.64269672.463092692.1868947
15.6777875.1555610.07222 3.4183331.8305562196.41889 3608.0256 1.4605556 0.57611110.0255500001.52862403.663809361.3560364
17.2000071.8000018.35000 5.1633332.150000 31.06000 3152.1283 1.3433333 0.48000000.0064833335.85323471.045083400.5974971
In [14]:
# for countries - representatives
km.c<-eclust(conti.rep.PAM, "kmeans", hc_metric="euclidean",k=5)
fviz_cluster(km.c, main="PAM rep / Euclidean")
fviz_silhouette(km.c)

km.c<-eclust(conti.rep.km, "kmeans", hc_metric="euclidean",k=5)
fviz_cluster(km.c, main="kmeans rep / Euclidean")
fviz_silhouette(km.c)
  cluster size ave.sil.width
1       1    2          0.60
2       2    4          0.69
3       3    2          0.74
4       4    1          0.00
5       5    1          0.00
  cluster size ave.sil.width
1       1    3          0.55
2       2    4          0.32
3       3    1          0.00
4       4    1          0.00
5       5    1          0.00
In [15]:
library(factoextra)
fviz_nbclust(t(CIA.new), FUNcluster=cluster::pam, method="gap_stat")
fviz_nbclust(CIA.new, FUNcluster=cluster::pam, method="gap_stat")
In [16]:
# boxplots in groups
install.packages("flexclust")
library(flexclust)

km1<-kmeans(CIA.new[,], 4) # stats::
groupBWplot(CIA.new[,], km1$cluster, alpha=0.05) #flexclust::

km1<-kmeans(CIA.new[,c(-6, -7)], 4) # stats::
groupBWplot(CIA.new[,c(-6, -7)], km1$cluster, alpha=0.05) #flexclust::
Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)

Loading required package: grid

Loading required package: modeltools

Loading required package: stats4

In [17]:
km1<-kmeans(CIA.new[,c(-6, -7, -8, -9)], 4) # stats::
groupBWplot(CIA.new[,c(-6, -7, -8, -9)], km1$cluster, alpha=0.05) #flexclust::

d1<-cclust(CIA.new[,c(-6, -7, -8, -9)], 4, dist="euclidean") #flexclust::
stripes(d1) #flexclust::
In [18]:
install.packages("psych")
library(psych)

km1<-kmeans(CIA.new[,], 4) # stats::
describeBy(CIA.new, km1$cluster) # psych::
Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)


Attaching package: ‘psych’


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


 Descriptive statistics by group 
group: 1
                    vars n     mean       sd   median  trimmed      mad
BIRTHS.1000.POPUL      1 2    12.25     1.48    12.25    12.25     1.56
LIFE.EXP.YEARS         2 2    82.50     1.13    82.50    82.50     1.19
ADULT.OBESITY          3 2    23.40     2.12    23.40    23.40     2.22
UNEMPL.RATE            4 2     5.31     2.39     5.31     5.31     2.51
MILITARY.EXP.TO.GDP    5 2     2.50     3.11     2.50     2.50     3.26
POPUL.DENSITY          6 2    68.14    91.54    68.14    68.14    95.97
ELECTR.CONS.PC         7 2 67329.72 23930.12 67329.72 67329.72 25087.30
MOBILE.PH.PC           8 2     0.57     0.64     0.57     0.57     0.67
INTERNET.USERS.PC      9 2     5.37     6.22     5.37     5.37     6.52
RAILWAY.DENSITY       10 2     0.02     0.02     0.02     0.02     0.02
AREA.100              11 2     0.09     0.12     0.09     0.09     0.13
POPUL.100             12 2     0.01     0.01     0.01     0.01     0.01
GDP.PC.PPP.100        13 2     3.15     1.39     3.15     3.15     1.46
                         min      max    range skew kurtosis       se
BIRTHS.1000.POPUL      11.20    13.30     2.10    0    -2.75     1.05
LIFE.EXP.YEARS         81.70    83.30     1.60    0    -2.75     0.80
ADULT.OBESITY          21.90    24.90     3.00    0    -2.75     1.50
UNEMPL.RATE             3.62     7.00     3.38    0    -2.75     1.69
MILITARY.EXP.TO.GDP     0.30     4.70     4.40    0    -2.75     2.20
POPUL.DENSITY           3.41   132.87   129.46    0    -2.75    64.73
ELECTR.CONS.PC      50408.57 84250.87 33842.30    0    -2.75 16921.15
MOBILE.PH.PC            0.12     1.03     0.91    0    -2.75     0.45
INTERNET.USERS.PC       0.97     9.76     8.79    0    -2.75     4.39
RAILWAY.DENSITY         0.01     0.03     0.02    0    -2.75     0.01
AREA.100                0.00     0.17     0.17    0    -2.75     0.09
POPUL.100               0.00     0.01     0.01    0    -2.75     0.01
GDP.PC.PPP.100          2.17     4.13     1.96    0    -2.75     0.98
------------------------------------------------------------ 
group: 2
                    vars n       mean        sd     median    trimmed    mad
BIRTHS.1000.POPUL      1 5      16.64      9.12      21.90      16.64   3.85
LIFE.EXP.YEARS         2 5      75.36     10.10      68.40      75.36   0.74
ADULT.OBESITY          3 5      39.56     21.25      50.00      39.56  16.31
UNEMPL.RATE            4 5       9.52      8.44       6.90       9.52   7.26
MILITARY.EXP.TO.GDP    5 5       2.42      1.57       1.60       2.42   0.59
POPUL.DENSITY          6 5      91.28    193.01       0.52      91.28   0.76
ELECTR.CONS.PC         7 5 1818402.60 652836.19 1395000.00 1818402.60   0.00
MOBILE.PH.PC           8 5     558.31    552.38     685.26     558.31 864.15
INTERNET.USERS.PC      9 5     468.21    442.33     502.18     468.21 663.73
RAILWAY.DENSITY       10 5       0.00      0.00       0.00       0.00   0.00
AREA.100              11 5       0.00      0.00       0.00       0.00   0.00
POPUL.100             12 5       0.00      0.00       0.00       0.00   0.00
GDP.PC.PPP.100        13 5       1.56      1.98       0.51       1.56   0.52
                          min        max      range  skew kurtosis        se
BIRTHS.1000.POPUL   6.400e+00      24.50      18.10 -0.27    -2.24      4.08
LIFE.EXP.YEARS      6.790e+01      89.30      21.40  0.40    -2.05      4.52
ADULT.OBESITY       9.600e+00      61.00      51.40 -0.35    -1.93      9.50
UNEMPL.RATE         2.000e+00      23.00      21.00  0.61    -1.53      3.78
MILITARY.EXP.TO.GDP 1.200e+00       4.70       3.50  0.44    -1.92      0.70
POPUL.DENSITY       1.000e-02     436.23     436.22  1.07    -0.93     86.32
ELECTR.CONS.PC      1.395e+06 2877922.08 1482922.08  0.70    -1.50 291957.22
MOBILE.PH.PC        0.000e+00    1268.12    1268.12  0.04    -2.02    247.03
INTERNET.USERS.PC   4.800e-01    1019.26    1018.78  0.05    -2.07    197.81
RAILWAY.DENSITY     0.000e+00       0.01       0.00  0.09    -2.20      0.00
AREA.100            0.000e+00       0.00       0.00  0.55    -1.76      0.00
POPUL.100           0.000e+00       0.00       0.00  1.07    -0.92      0.00
GDP.PC.PPP.100      1.600e-01       4.81       4.65  0.73    -1.43      0.88
------------------------------------------------------------ 
group: 3
                    vars   n    mean      sd  median trimmed     mad   min
BIRTHS.1000.POPUL      1 165   20.40    9.54   17.80   19.46    8.60  7.80
LIFE.EXP.YEARS         2 165   72.00    7.09   74.20   72.58    6.23 52.80
ADULT.OBESITY          3 165   18.42   10.09   20.20   17.87   11.12  2.10
UNEMPL.RATE            4 165   11.39   11.16    7.86    9.50    6.26  0.30
MILITARY.EXP.TO.GDP    5 165    1.78    1.26    1.50    1.60    0.80  0.20
POPUL.DENSITY          6 165  173.06  454.66   79.99   98.46   79.91  0.26
ELECTR.CONS.PC         7 165 1739.76 1631.48 1307.77 1559.35 1721.11  0.00
MOBILE.PH.PC           8 165    1.09    1.38    1.02    1.00    0.40  0.00
INTERNET.USERS.PC      9 165    0.49    0.64    0.46    0.44    0.37  0.00
RAILWAY.DENSITY       10 165    0.02    0.03    0.01    0.01    0.01  0.00
AREA.100              11 165    0.81    1.86    0.20    0.45    0.29  0.00
POPUL.100             12 165    1.16    4.53    0.21    0.41    0.31  0.00
GDP.PC.PPP.100        13 165    0.62    0.64    0.45    0.51    0.45  0.03
                        max   range  skew kurtosis     se
BIRTHS.1000.POPUL     47.50   39.70  0.80    -0.33   0.74
LIFE.EXP.YEARS        83.50   30.70 -0.75    -0.10   0.55
ADULT.OBESITY         55.90   53.80  0.54     0.51   0.79
UNEMPL.RATE           77.00   76.70  2.28     7.44   0.87
MILITARY.EXP.TO.GDP    8.80    8.60  1.93     5.84   0.10
POPUL.DENSITY       5328.39 5328.13  9.15    97.96  35.40
ELECTR.CONS.PC      5625.78 5625.78  0.73    -0.71 127.01
MOBILE.PH.PC          17.87   17.87 10.77   128.02   0.11
INTERNET.USERS.PC      7.86    7.86  9.07   101.18   0.05
RAILWAY.DENSITY        0.19    0.19  2.54     9.03   0.00
AREA.100              15.97   15.97  5.77    40.25   0.14
POPUL.100             41.28   41.28  7.89    64.97   0.35
GDP.PC.PPP.100         3.52    3.49  2.11     5.54   0.05
------------------------------------------------------------ 
group: 4
                    vars  n    mean      sd  median trimmed     mad     min
BIRTHS.1000.POPUL      1 50   11.92    3.52   11.50   11.44    2.89    6.70
LIFE.EXP.YEARS         2 50   79.90    3.21   80.45   80.11    2.67   71.90
ADULT.OBESITY          3 50   25.22   11.80   22.85   24.48    6.89    4.10
UNEMPL.RATE            4 50    7.01    6.24    5.07    5.87    3.60    1.00
MILITARY.EXP.TO.GDP    5 50    2.76    2.29    1.75    2.39    1.39    0.56
POPUL.DENSITY          6 50 1061.09 3396.98  185.81  238.15  242.30    0.03
ELECTR.CONS.PC         7 50 9589.21 5055.84 7901.53 8540.99 2344.45 5771.11
MOBILE.PH.PC           8 50    1.53    1.88    1.24    1.20    0.25    0.00
INTERNET.USERS.PC      9 50    0.95    1.09    0.86    0.85    0.10    0.00
RAILWAY.DENSITY       10 50    0.04    0.04    0.03    0.04    0.04    0.00
AREA.100              11 50    1.75    5.34    0.03    0.24    0.04    0.00
POPUL.100             12 50    0.60    1.60    0.08    0.19    0.12    0.00
GDP.PC.PPP.100        13 50    2.15    1.14    1.98    2.02    0.74    0.15
                         max    range  skew kurtosis     se
BIRTHS.1000.POPUL      23.10    16.40  1.33     1.93   0.50
LIFE.EXP.YEARS         86.00    14.10 -0.50    -0.27   0.45
ADULT.OBESITY          55.30    51.20  0.74     0.84   1.67
UNEMPL.RATE            36.00    35.00  2.54     8.01   0.88
MILITARY.EXP.TO.GDP     8.80     8.24  1.23     0.49   0.32
POPUL.DENSITY       21789.29 21789.26  4.85    25.38 480.41
ELECTR.CONS.PC      33609.45 27838.34  2.66     8.54 715.00
MOBILE.PH.PC           10.84    10.84  3.78    14.89   0.27
INTERNET.USERS.PC       8.14     8.14  5.69    34.78   0.15
RAILWAY.DENSITY         0.12     0.12  0.62    -1.03   0.01
AREA.100               28.46    28.46  3.51    12.25   0.75
POPUL.100               9.85     9.85  4.29    20.50   0.23
GDP.PC.PPP.100          5.79     5.64  1.26     1.89   0.16
In [19]:
# PAM
library(factoextra)

# for features
pam.f<-eclust(t(CIA.new), "pam", k=5) # factoextra::
fviz_cluster(pam.f, main="kmeans / Euclidean", ylim=c(-15, 5), xlim=c(-15, 55), labelsize=10, repel=TRUE)
fviz_silhouette(pam.f)
  cluster size ave.sil.width
1       1    8          0.80
2       2    1          0.00
3       3    1          0.00
4       4    1          0.00
5       5    2          0.71
In [20]:
# for countries
pam.c<-eclust(CIA.new, "pam", k=15) # factoextra::
fviz_cluster(pam.c, main="kmeans/Euclidean")
fviz_silhouette(pam.c)
   cluster size ave.sil.width
1        1   69          0.64
2        2   42          0.46
3        3   24          0.58
4        4    1          0.00
5        5   27          0.52
6        6   15          0.45
7        7    8          0.41
8        8    1          0.00
9        9    1          0.00
10      10   25          0.48
11      11    3          0.46
12      12    1          0.00
13      13    1          0.00
14      14    3          1.00
15      15    1          0.00
In [21]:
# MDS
install.packages("maptools")
library(maptools)

# for features
dist.f<-dist(t(CIA.new))
mds.f<-cmdscale(dist.f, k=2) 

# very basic plot
plot(mds.f[,1], mds.f[,2]) 

# fine-tuned plot
plot(jitter(mds.f[,1], amount=500000), jitter(mds.f[,2], amount=3000), xlim=c(-2500000, 5000000), ylim=c(-5000, 25000))
pointLabel(jitter(mds.f[,1], amount=500000), jitter(mds.f[,2], amount=3000), rownames(mds.f), cex=0.8) # smart labeling with maptools::
Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)

Loading required package: sp

Checking rgeos availability: FALSE
 	Note: when rgeos is not available, polygon geometry 	computations in maptools depend on gpclib,
 	which has a restricted licence. It is disabled by default;
 	to enable gpclib, type gpclibPermit()

In [22]:
# for countries
dist.c<-dist(CIA.new)
mds.c<-cmdscale(dist.c, k=2) 

# very basic plot
plot(mds.c[,1], mds.c[,2]) 

# fine-tuned plot
plot(jitter(mds.c[,1], amount=100000), jitter(mds.c[,2], amount=5000), xlim=c(-300000, 3000000), ylim=c(-5000, 25000))
x<-which(mds.c[,1]>1000000 | mds.c[,2]>10000)
pointLabel(jitter(mds.c[x,1], amount=500000), jitter(mds.c[x,2], amount=3000), countries[x], cex=0.8) # smart labeling with maptools::
In [23]:
# for countries
dist.c<-dist(conti.rep.PAM)
mds.c<-cmdscale(dist.c, k=2) 

# very basic plot
plot(mds.c[,1], mds.c[,2]) 

# fine-tuned plot
plot(jitter(mds.c[,1], amount=5), jitter(mds.c[,2], amount=5), xlim=c(-10000, 5000), ylim=c(-200, 200))
pointLabel(jitter(mds.c[,1], amount=5), jitter(mds.c[,2], amount=5), rownames(conti.rep.PAM), cex=0.8) # smart labeling with maptools::
In [24]:
# PCA
library(psych)

# for features
pca.f<-principal(CIA.new, nfactors=4, rotate="varimax")
pca.f
print(loadings(pca.f), digits=3, cutoff=0.4, sort=TRUE)
Principal Components Analysis
Call: principal(r = CIA.new, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
                      RC2   RC1   RC4   RC3   h2    u2 com
BIRTHS.1000.POPUL   -0.06 -0.71 -0.48 -0.09 0.75 0.251 1.8
LIFE.EXP.YEARS       0.07  0.78  0.46  0.04 0.82 0.182 1.6
ADULT.OBESITY        0.12  0.03  0.88 -0.09 0.80 0.203 1.1
UNEMPL.RATE         -0.03 -0.52  0.10 -0.21 0.32 0.676 1.4
MILITARY.EXP.TO.GDP -0.02  0.06  0.54  0.01 0.30 0.702 1.0
POPUL.DENSITY       -0.04  0.55 -0.30 -0.13 0.41 0.594 1.7
ELECTR.CONS.PC       0.94 -0.03  0.10 -0.03 0.89 0.105 1.0
MOBILE.PH.PC         0.98  0.03  0.02 -0.01 0.97 0.034 1.0
INTERNET.USERS.PC    0.98  0.06  0.00 -0.01 0.96 0.037 1.0
RAILWAY.DENSITY     -0.13  0.71 -0.01 -0.23 0.57 0.429 1.3
AREA.100            -0.04 -0.05  0.11  0.84 0.72 0.282 1.0
POPUL.100           -0.02  0.01 -0.14  0.83 0.70 0.297 1.1
GDP.PC.PPP.100       0.12  0.80  0.21 -0.02 0.70 0.304 1.2

                       RC2  RC1  RC4  RC3
SS loadings           2.87 2.82 1.69 1.52
Proportion Var        0.22 0.22 0.13 0.12
Cumulative Var        0.22 0.44 0.57 0.68
Proportion Explained  0.32 0.32 0.19 0.17
Cumulative Proportion 0.32 0.64 0.83 1.00

Mean item complexity =  1.3
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.08 
 with the empirical chi square  224.91  with prob <  7.4e-31 

Fit based upon off diagonal values = 0.92
Loadings:
                    RC2    RC1    RC4    RC3   
ELECTR.CONS.PC       0.940                     
MOBILE.PH.PC         0.982                     
INTERNET.USERS.PC    0.980                     
BIRTHS.1000.POPUL          -0.712 -0.481       
LIFE.EXP.YEARS              0.776  0.458       
UNEMPL.RATE                -0.516              
POPUL.DENSITY               0.546              
RAILWAY.DENSITY             0.707              
GDP.PC.PPP.100              0.799              
ADULT.OBESITY                      0.880       
MILITARY.EXP.TO.GDP                0.542       
AREA.100                                  0.839
POPUL.100                                 0.826

                 RC2   RC1   RC4   RC3
SS loadings    2.867 2.822 1.694 1.522
Proportion Var 0.221 0.217 0.130 0.117
Cumulative Var 0.221 0.438 0.568 0.685
In [25]:
# hierarchical

# for features
dist.f<-dist(t(CIA.new))
tree.f<-hclust(dist.f, method="complete")
plot(tree.f)

dist.fn<-dist(t(CIA.new[,c(-6, -7)]))
tree.fn<-hclust(dist.f, method="complete")
plot(tree.fn, main="no population and electricity consumption")
In [26]:
dend.f<-hcut(t(CIA.new[,c(-6, -7)]), k=5, stand=FALSE)
dend.f$cluster
fviz_dend(dend.f, rect=TRUE)
fviz_silhouette(dend.f)
fviz_cluster(dend.f)
BIRTHS.1000.POPUL
1
LIFE.EXP.YEARS
2
ADULT.OBESITY
1
UNEMPL.RATE
1
MILITARY.EXP.TO.GDP
3
MOBILE.PH.PC
4
INTERNET.USERS.PC
5
RAILWAY.DENSITY
3
AREA.100
3
POPUL.100
3
GDP.PC.PPP.100
3
  cluster size ave.sil.width
1       1    3          0.09
2       2    1          0.00
3       3    5          0.83
4       4    1          0.00
5       5    1          0.00
In [27]:
# for countries / continents

# we select medoids – most middle representants of world’regions
vec<-unique(continents)
meds<-matrix(0, ncol=1, nrow=10)
for(i in 1:10){
vec[i]
x<-which(continents==vec[i])
sub<-CIA.new[x,]
ppam<-eclust(sub, "pam", k=1) # factoextra::
meds[i,1]<-rownames(ppam$medoids)}
meds
rownames(meds)<-vec
conti.rep.PAM<-CIA.new[meds,]
rownames(conti.rep.PAM)<-vec

dist.c<-dist(conti.rep.PAM)
tree.c<-hclust(dist.c, method="complete")
plot(tree.c)
A matrix: 10 × 1 of type chr
187
180
47
217
8
27
202
210
215
203
In [28]:
# we use k-means – artificial parameters of centroid observation of world’ regions
vec<-unique(continents)
meds<-matrix(0, ncol=13, nrow=10)
for(i in 1:10){
vec[i]
x<-which(continents==vec[i])
sub<-CIA.new[x,]
kkm<-eclust(sub, "kmeans", k=1) # factoextra::
meds[i,]<-kkm$centers}
meds
rownames(meds)<-vec
conti.rep.km<-meds

dist.c<-dist(conti.rep.km)
tree.c<-hclust(dist.c, method="complete")
plot(tree.c)
fviz_dend(tree.c, k=3, cex=0.5, k_colors=c("#00AFBB","#E7B800","#FC4E07"),
          color_labels_by_k=TRUE, ggtheme=theme_minimal())
A matrix: 10 × 13 of type dbl
20.6250070.46250 5.73750 7.0912502.037500 467.36250 774.6350 1.0250000 0.31750000.0255250001.06832826.712128760.3254349
10.2591879.7489822.37551 6.9865311.720204 271.50122 93441.213540.916938837.14693880.0448918370.20390800.367905221.7967350
30.7211564.7288512.0615415.3973081.605192 99.84519 585.0156 0.8394231 0.26442310.0082750000.88376480.700600840.2611317
17.9950075.1100038.6350012.6590002.736000 113.15200244300.396042.472000028.15800000.0155050000.67393240.049932120.6192620
14.3903277.0258123.0161310.2967741.405806 238.30161 4652.1774 2.0793548 1.03967740.0323161290.03982390.085463900.9307914
15.5076975.6846223.52308 7.4600001.541538 21.47692 2349.7169 1.1307692 0.55692310.0133538462.27159120.969293380.7985111
17.8789576.1105327.1473715.1731584.634737 529.33053 5545.1295 1.2410526 0.76894740.0259473680.56691690.557784631.4643749
12.0333379.4166729.45000 6.3083332.173333 43.01667 21328.2183 0.8783333 2.14500000.0204666676.64269672.463092692.1868947
15.6777875.1555610.07222 3.4183331.8305562196.41889 3608.0256 1.4605556 0.57611110.0255500001.52862403.663809361.3560364
17.2000071.8000018.35000 5.1633332.150000 31.06000 3152.1283 1.3433333 0.48000000.0064833335.85323471.045083400.5974971
In [29]:
#########################################################################
#predictions
#########################################################################

# we have objects as follows: km.f, pam.f, mds.f, pca.f, tree.f
# we have new dataset as follows: originial – CIA.new & new one conti.rep.PAM & conti.rep.km

#prediction in kmeans
library(flexclust)

# for representatives from PAM
km.c<-eclust(CIA.new, "kmeans", hc_metric="euclidean",k=7)
km.c.kcca<-as.kcca(km.c, CIA.new) # conversion to kcca
km.c.p<-predict(km.c.kcca, conti.rep.PAM) # prediction for k-means
km.c.p

# for representatives from k-means
km.c<-eclust(CIA.new, "kmeans", hc_metric="euclidean",k=7)
km.c.kcca<-as.kcca(km.c, CIA.new) # conversion to kcca
km.c.p<-predict(km.c.kcca, conti.rep.km) # prediction for k-means
km.c.p

p1<-fviz_cluster(list(data=CIA.new, cluster=km.c$cluster), stand=F) + ggtitle("train")+ xlim(-3e+06, 1e+05) + ylim(-25000, 10000)
p2<-fviz_cluster(list(data=conti.rep.km, cluster=km.c.p),stand=F) + ggtitle("test")+ xlim(-3e+06, 1e+05) + ylim(-25000, 10000)
gridExtra::grid.arrange(p1, p2, nrow=2)
SOUTH ASIA
5
EUROPE
6
AFRICA
5
AUSTRALIA & OCEANIA
6
CENTRAL AMERICA
6
SOUTH AMERICA
6
MIDDLE EAST
6
NORTH AMERICA
4
EAST ASIA
5
CENTRAL ASIA
6
SOUTH ASIA
5
EUROPE
3
AFRICA
5
AUSTRALIA & OCEANIA
3
CENTRAL AMERICA
6
SOUTH AMERICA
6
MIDDLE EAST
6
NORTH AMERICA
2
EAST ASIA
6
CENTRAL ASIA
6
In [30]:
# prediction in pam
pam.c<-eclust(CIA.new, "pam", k=5) # factoextra::, class "pam", "partition", "eclust"   
pam.c.kcca<-as.kcca(pam.c, CIA.new) # conversion to kcca
pam.c.p<-predict(pam.c.kcca, conti.rep.km) # prediction for PAM
pam.c.p

p1<-fviz_cluster(list(data=CIA.new, cluster=pam.c$clustering), stand=F) + ggtitle("train") + xlim(-3e+06, 1e+05) + ylim(-25000, 10000)
p2<-fviz_cluster(list(data=conti.rep.km, cluster=pam.c.p),stand=F) + ggtitle("test") + xlim(-3e+06, 1e+05) + ylim(-25000, 10000)
gridExtra::grid.arrange(p1, p2, nrow=2)
SOUTH ASIA
1
EUROPE
3
AFRICA
1
AUSTRALIA & OCEANIA
3
CENTRAL AMERICA
3
SOUTH AMERICA
1
MIDDLE EAST
3
NORTH AMERICA
3
EAST ASIA
3
CENTRAL ASIA
1
In [31]:
# prediction in hierarchical tree
# here we run the tree on representatives from PAM, 
# while prediction is for representatives from K-means

library(factoextra)
library(class)

dist.c<-dist(conti.rep.PAM) # distance matrix
dist.c.hc<-hclust(dist.c, method="ward.D2") # dendrogram
fviz_dend(dist.c.hc, k=3, cex=0.5, k_colors=c("#00AFBB","#E7B800","#FC4E07"),
          color_labels_by_k=TRUE, ggtheme=theme_minimal())

groups<-cutree(dist.c.hc, k=3) # clustering vector
table(groups)
knnClust<-knn(train=conti.rep.PAM, test=conti.rep.km, k=1, cl=groups)
knnClust

p1<-fviz_cluster(list(data=conti.rep.PAM, cluster=groups), stand=F) + ggtitle("train")
p2<-fviz_cluster(list(data=conti.rep.km, cluster=knnClust),stand=F) + ggtitle("test")
gridExtra::grid.arrange(p1, p2, nrow=2)
groups
1 2 3 
7 2 1 
  1. 1
  2. 3
  3. 1
  4. 3
  5. 2
  6. 1
  7. 2
  8. 3
  9. 1
  10. 1
Levels:
  1. '1'
  2. '2'
  3. '3'
In [32]:
# prediction in PCA
pca.f<-prcomp(CIA.new, center=F, scale=F)
pca.f
pca.f.p<-predict(pca.f, newdata=conti.rep.PAM)
pca.f.p


library(psych)
pca.f<-principal(CIA.new, nfactors=3, rotate="varimax")
pca.f
Standard deviations (1, .., p=13):
 [1] 2.873986e+05 1.729151e+03 7.709725e+01 6.750026e+01 1.472707e+01
 [6] 1.265134e+01 9.786823e+00 7.016182e+00 4.261177e+00 2.420279e+00
[11] 1.494387e+00 7.048120e-01 2.567795e-02

Rotation (n x k) = (13 x 13):
                              PC1           PC2           PC3           PC4
BIRTHS.1000.POPUL   -8.313453e-06 -1.641183e-03  2.263710e-01 -1.388285e-02
LIFE.EXP.YEARS      -4.181649e-05 -9.917327e-03  9.191699e-01 -1.283968e-01
ADULT.OBESITY       -2.051932e-05 -1.590414e-03  2.658524e-01  5.844585e-03
UNEMPL.RATE         -5.112179e-06 -8.066162e-04  1.304185e-01 -1.300800e-02
MILITARY.EXP.TO.GDP -1.311451e-06 -3.840949e-04  2.491379e-02 -1.908404e-03
POPUL.DENSITY       -6.148592e-05 -9.999476e-01 -1.008920e-02  8.243245e-04
ELECTR.CONS.PC      -9.999999e-01  6.173648e-05  8.562762e-06  4.456592e-04
MOBILE.PH.PC        -3.455189e-04  3.274053e-04 -9.442325e-02 -7.515978e-01
INTERNET.USERS.PC   -2.792452e-04  3.546338e-04 -7.807373e-02 -6.466766e-01
RAILWAY.DENSITY     -3.046196e-09 -8.345674e-06  2.934001e-04 -4.354555e-05
AREA.100            -6.086022e-08 -1.904574e-05  1.318409e-02 -1.922407e-03
POPUL.100           -3.449410e-08 -7.314483e-05  1.244988e-02 -2.069321e-03
GDP.PC.PPP.100      -8.801522e-07 -3.546726e-04  1.156646e-02 -3.970426e-03
                              PC5           PC6           PC7           PC8
BIRTHS.1000.POPUL   -7.405726e-02  6.967211e-01  1.991269e-01  6.440821e-01
LIFE.EXP.YEARS       7.826793e-02 -1.044440e-01  2.160290e-01 -2.660684e-01
ADULT.OBESITY       -1.456659e-01 -4.845276e-01 -5.656174e-01  5.808719e-01
UNEMPL.RATE         -1.508343e-01  5.122304e-01 -7.379855e-01 -3.906907e-01
MILITARY.EXP.TO.GDP -1.377447e-02 -1.126756e-02 -3.288719e-02  3.929068e-02
POPUL.DENSITY       -2.521629e-04  2.663659e-04 -9.915301e-04  9.959073e-04
ELECTR.CONS.PC       1.797186e-05  9.383511e-06  1.292322e-05 -1.809492e-05
MOBILE.PH.PC        -6.397368e-01 -4.908347e-02  1.135781e-01 -3.842231e-02
INTERNET.USERS.PC    7.312269e-01  4.835758e-02 -1.696659e-01  9.716214e-02
RAILWAY.DENSITY      5.298076e-05 -7.483350e-04  2.169591e-04 -1.057463e-03
AREA.100             5.802663e-05 -8.684351e-03  2.134680e-02 -1.048359e-02
POPUL.100            4.847376e-03  1.889496e-02  7.684774e-02 -1.058799e-01
GDP.PC.PPP.100       1.583835e-02 -3.488015e-02  1.977227e-03 -2.305175e-02
                              PC9          PC10          PC11          PC12
BIRTHS.1000.POPUL   -3.415818e-02 -1.330851e-02  1.091633e-02 -4.122058e-02
LIFE.EXP.YEARS       5.821030e-02  9.241136e-03  7.989087e-03  2.256970e-02
ADULT.OBESITY       -1.003501e-01 -4.896621e-02  5.798876e-02 -1.159499e-02
UNEMPL.RATE         -1.096955e-02  1.015749e-02  6.819184e-03 -1.258508e-02
MILITARY.EXP.TO.GDP -4.326300e-02  1.795417e-02 -9.772369e-01  1.981070e-01
POPUL.DENSITY       -2.722143e-04  1.394575e-07  2.498269e-04  1.464388e-04
ELECTR.CONS.PC       1.900316e-06  1.503230e-06 -9.456387e-07 -1.524575e-07
MOBILE.PH.PC         9.870807e-03  1.820087e-03  1.718982e-03 -5.691937e-03
INTERNET.USERS.PC   -1.870712e-02 -5.525703e-03  5.025325e-04  8.639753e-03
RAILWAY.DENSITY      1.257209e-03 -1.164191e-03 -1.055832e-03 -9.816834e-03
AREA.100            -5.035061e-01  8.624358e-01  4.048160e-02  1.501974e-02
POPUL.100           -8.539565e-01 -5.019767e-01  1.578646e-02 -2.954066e-02
GDP.PC.PPP.100       1.317310e-02  3.321229e-02 -1.988050e-01 -9.782385e-01
                             PC13
BIRTHS.1000.POPUL    7.305171e-04
LIFE.EXP.YEARS      -5.183434e-04
ADULT.OBESITY        3.209065e-04
UNEMPL.RATE          8.716131e-06
MILITARY.EXP.TO.GDP  1.021915e-03
POPUL.DENSITY       -1.824582e-06
ELECTR.CONS.PC      -4.639839e-09
MOBILE.PH.PC        -1.374909e-04
INTERNET.USERS.PC    2.341959e-04
RAILWAY.DENSITY      9.999489e-01
AREA.100             1.801161e-03
POPUL.100            9.738446e-05
GDP.PC.PPP.100      -9.846825e-03
A matrix: 10 × 13 of type dbl
PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12PC13
SOUTH ASIA -553.5654-349.6221072.87665-10.730116 2.9550710 1.783889612.7901983-10.0248302 2.6535149 0.1285416-0.750905057 0.9231537502-8.083399e-03
EUROPE -4896.5308-111.4658078.51197 -9.345230 1.4410842 -9.0352723 3.3037780 -4.6886861 1.8406335-0.2859976 0.003543161 0.0811643626-3.144829e-02
AFRICA -227.2589 -85.8810965.91614 -9.529000-0.8808303 13.6660579 6.3157558 4.7608943 0.4097931-0.1497680 0.419886211-0.0022239277-4.802813e-03
AUSTRALIA & OCEANIA -5529.9506-112.1868186.76440 -7.825857-1.4383124-13.3193322-8.1551883 5.4706996 0.2685965-1.1611507 1.953435656 0.8151849161-1.111804e-03
CENTRAL AMERICA -3135.1078-222.4459678.65034-10.079514-0.6761920 -0.9241077 0.9363091 -3.8365001 1.8662523-0.2340077 0.328312054-0.0035954938 3.049745e-02
SOUTH AMERICA -2404.6456 -25.5191479.10657 -9.929423-0.2584807 -2.9543670-1.7184809 -3.6991908-11.0159294 8.6738323 1.191953423 0.4125897971 5.752134e-05
MIDDLE EAST -2817.7007-105.3284582.17974 -9.763239-1.9848985 -6.1258045-8.8740420 2.5543006 -2.2947226-0.9514896 0.767459984-0.2066456568-2.311917e-02
NORTH AMERICA-11730.4358 -33.9839886.77575 -6.614633-0.4080003-15.2758316-2.4191928 4.8493073-16.1824688 8.1721726-0.116115063-0.9647550026 1.834182e-02
EAST ASIA -1450.5721-298.7228469.55596-10.401139 3.5212016 2.979534915.4034686-10.3992324 0.6434725-0.5170163-1.264726567 1.1562604843-2.695048e-02
CENTRAL ASIA -2729.4346 -11.9366575.85091 -9.570464-0.9772309 1.8413321 0.5717895 -0.7530771 0.9614484 0.2872300 0.638407884-0.0007870392-1.197868e-02
Principal Components Analysis
Call: principal(r = CIA.new, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
                      RC1   RC2   RC3   h2    u2 com
BIRTHS.1000.POPUL   -0.85 -0.07 -0.08 0.73 0.268 1.0
LIFE.EXP.YEARS       0.90  0.08  0.02 0.81 0.189 1.0
ADULT.OBESITY        0.39  0.22 -0.05 0.20 0.800 1.6
UNEMPL.RATE         -0.43  0.01 -0.19 0.22 0.777 1.4
MILITARY.EXP.TO.GDP  0.28  0.03  0.03 0.08 0.920 1.1
POPUL.DENSITY        0.37 -0.10 -0.17 0.17 0.826 1.6
ELECTR.CONS.PC       0.02  0.94 -0.03 0.89 0.106 1.0
MOBILE.PH.PC         0.04  0.97 -0.01 0.95 0.049 1.0
INTERNET.USERS.PC    0.06  0.97 -0.01 0.94 0.061 1.0
RAILWAY.DENSITY      0.63 -0.17 -0.27 0.50 0.501 1.5
AREA.100             0.02 -0.03  0.84 0.71 0.286 1.0
POPUL.100           -0.03 -0.04  0.82 0.67 0.331 1.0
GDP.PC.PPP.100       0.81  0.09 -0.05 0.67 0.328 1.0

                       RC1  RC2  RC3
SS loadings           3.14 2.89 1.53
Proportion Var        0.24 0.22 0.12
Cumulative Var        0.24 0.46 0.58
Proportion Explained  0.42 0.38 0.20
Cumulative Proportion 0.42 0.80 1.00

Mean item complexity =  1.2
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09 
 with the empirical chi square  282.62  with prob <  2.1e-37 

Fit based upon off diagonal values = 0.9
In [33]:
pca.f.p<-predict.psych(pca.f, conti.rep.PAM , CIA.new)
pca.f.p
A matrix: 10 × 3 of type dbl
RC1RC2RC3
SOUTH ASIA 0.05134626-0.222572538-0.04910072
EUROPE 0.43517573-0.073893955-0.02737097
AFRICA-1.32668136-0.131189376-0.10603264
AUSTRALIA & OCEANIA 0.29927736-0.020748963-0.21695124
CENTRAL AMERICA 0.41015703-0.225487315-0.47920896
SOUTH AMERICA 0.02084484-0.063125486 3.29355097
MIDDLE EAST 0.11574677 0.007638035 0.35034098
NORTH AMERICA 1.30368501-0.044792743 4.15573096
EAST ASIA-0.26706754-0.181728001 0.47612369
CENTRAL ASIA-0.35128670-0.092945731-0.08368463